The Thera Bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
We need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
#
# Loading Necessary Libraries
#
# To help with reading and manipulating data
import pandas as pd
import numpy as np
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(
color_codes=True
) # -----This adds a background color to all the plots created using seaborn
# Allow the use of Display via interactive Python
from IPython.display import display
# Import tabulate. library used for creating tables in a visually appealing format.
from tabulate import tabulate
# Import library for exploratory visualization of missing data.
import missingno as ms
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To be used for creating & personalizing pipelines
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from imblearn.pipeline import Pipeline as imb_Pipeline
from imblearn.pipeline import make_pipeline as make_imb_pipeline
# ---To be used to transform User Defined Functions into a transformer function
from sklearn.preprocessing import FunctionTransformer
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import RandomOverSampler
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# Making the Python code more structured automatically
%load_ext nb_black
print("Loading Libraries... Done.")
Loading Libraries... Done.
# Loading Dataset
data_path = "BankChurners.csv"
data = pd.read_csv(data_path)
# Making a copy of the data to avoid any changes to original data
df = data.copy()
print("Loading Dataset... Done.")
Loading Dataset... Done.
# Checking the top 5, botom 5 and 10 random rows
display(df.head()) # -----looking at head (top 5 observations)
display(df.tail()) # -----looking at tail (bottom 5 observations)
display(
df.sample(10, random_state=1)
) # -----10 random sample of observations from the data
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | 3 | 2 | 3 | 4003.000 | 1851 | 2152.000 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.000 | 2186 | 2091.000 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | 5 | 3 | 4 | 5409.000 | 0 | 5409.000 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.000 | 0 | 5281.000 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | 6 | 2 | 4 | 10388.000 | 1961 | 8427.000 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6498 | 712389108 | Existing Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Blue | 36 | 6 | 3 | 2 | 2570.000 | 2107 | 463.000 | 0.651 | 4058 | 83 | 0.766 | 0.820 |
| 9013 | 718388733 | Existing Customer | 38 | F | 1 | College | NaN | Less than $40K | Blue | 32 | 2 | 3 | 3 | 2609.000 | 1259 | 1350.000 | 0.871 | 8677 | 96 | 0.627 | 0.483 |
| 2053 | 710109633 | Existing Customer | 39 | M | 2 | College | Married | $60K - $80K | Blue | 31 | 6 | 3 | 2 | 9871.000 | 1061 | 8810.000 | 0.545 | 1683 | 34 | 0.478 | 0.107 |
| 3211 | 717331758 | Existing Customer | 44 | M | 4 | Graduate | Married | $120K + | Blue | 32 | 6 | 3 | 4 | 34516.000 | 2517 | 31999.000 | 0.765 | 4228 | 83 | 0.596 | 0.073 |
| 5559 | 709460883 | Attrited Customer | 38 | F | 2 | Doctorate | Married | Less than $40K | Blue | 28 | 5 | 2 | 4 | 1614.000 | 0 | 1614.000 | 0.609 | 2437 | 46 | 0.438 | 0.000 |
| 6106 | 789105183 | Existing Customer | 54 | M | 3 | Post-Graduate | Single | $80K - $120K | Silver | 42 | 3 | 1 | 2 | 34516.000 | 2488 | 32028.000 | 0.552 | 4401 | 87 | 0.776 | 0.072 |
| 4150 | 771342183 | Attrited Customer | 53 | F | 3 | Graduate | Single | $40K - $60K | Blue | 40 | 6 | 3 | 2 | 1625.000 | 0 | 1625.000 | 0.689 | 2314 | 43 | 0.433 | 0.000 |
| 2205 | 708174708 | Existing Customer | 38 | M | 4 | Graduate | Married | $40K - $60K | Blue | 27 | 6 | 2 | 4 | 5535.000 | 1276 | 4259.000 | 0.636 | 1764 | 38 | 0.900 | 0.231 |
| 4145 | 718076733 | Existing Customer | 43 | M | 1 | Graduate | Single | $60K - $80K | Silver | 31 | 4 | 3 | 3 | 25824.000 | 1170 | 24654.000 | 0.684 | 3101 | 73 | 0.780 | 0.045 |
| 5324 | 821889858 | Attrited Customer | 50 | F | 1 | Doctorate | Single | abc | Blue | 46 | 6 | 4 | 3 | 1970.000 | 1477 | 493.000 | 0.662 | 2493 | 44 | 0.571 | 0.750 |
Observations
We will drop CLIENTNUM column. It's adding no value to our analysis and models
Attrition_Flag is our target variable and it will be converted to 0s and 1s.
Income_Category has an entry with value 'abc' for record 5324. We need to investigate this further.
# -----Print the dimension of the data
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns")
There are 10127 rows and 21 columns
# -----Displaying information about features of the Dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
Observations
Education_Level & Marital_Status.For Numerical Variables
# -----Displaying Statistical Summary of Numerical Data
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.000 | 739177606.334 | 36903783.450 | 708082083.000 | 713036770.500 | 717926358.000 | 773143533.000 | 828343083.000 |
| Customer_Age | 10127.000 | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.000 | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.000 | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
Observations
Age of a customer is about 46 and the youngest & oldest are about 26 and 73 respectively.Dependent_count, of a customer is 2.35, with the customers with most dependents have 5.Months_on_book, is about 36 months. With minimum of 13 months and a maximum of 56 months.Total_Trans_Amt is about 4,404 with minimum and maximum total transaction amounts of 510 & 18,484 respectively.Total_Trans_Ct is about 65 with minimum and maximum total transaction amounts of 10 & 139 respectively.Months_Inactive_12_mon, is 0. This means that all the customers were active in the last 12 months, i.e. non of the customers have a single month of inactivity.# -----Displaying the Summary of Categorical Data
df.describe(include=["object"]).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 10127 | 6 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
# Get the Categorical Variables (Object types)
cat_cols = df.select_dtypes(["object"])
# Check the unique values of the categorical variables
for i in cat_cols.columns:
print("Unique values % in", i, "are :")
print(cat_cols[i].value_counts(normalize=True) * 100)
print("*" * 50)
print("\n")
Unique values % in Attrition_Flag are : Existing Customer 83.934 Attrited Customer 16.066 Name: Attrition_Flag, dtype: float64 ************************************************** Unique values % in Gender are : F 52.908 M 47.092 Name: Gender, dtype: float64 ************************************************** Unique values % in Education_Level are : Graduate 36.338 High School 23.385 Uneducated 17.275 College 11.768 Post-Graduate 5.994 Doctorate 5.239 Name: Education_Level, dtype: float64 ************************************************** Unique values % in Marital_Status are : Married 49.979 Single 42.045 Divorced 7.976 Name: Marital_Status, dtype: float64 ************************************************** Unique values % in Income_Category are : Less than $40K 35.163 $40K - $60K 17.676 $80K - $120K 15.157 $60K - $80K 13.844 abc 10.981 $120K + 7.179 Name: Income_Category, dtype: float64 ************************************************** Unique values % in Card_Category are : Blue 93.177 Silver 5.480 Gold 1.145 Platinum 0.197 Name: Card_Category, dtype: float64 **************************************************
Observations
Female and 47% Male customers.Income_Category has the value "abc". This will be fixed.# Checking missing values across each columns
c_missing = pd.Series(df.isnull().sum(), name="Missing Count") # -----Count Missing
p_missing = pd.Series(
round(df.isnull().sum() / df.shape[0] * 100, 2), name="% Missing"
) # -----Percentage Missing
# Combine into 1 Dataframe
missing_df = pd.concat([c_missing, p_missing], axis=1)
# # Display missing info
# display(missing_df)
missing_df.sort_values(by="% Missing", ascending=False).style.background_gradient(
cmap="YlOrRd"
)
| Missing Count | % Missing | |
|---|---|---|
| Education_Level | 1519 | 15.000000 |
| Marital_Status | 749 | 7.400000 |
| CLIENTNUM | 0 | 0.000000 |
| Contacts_Count_12_mon | 0 | 0.000000 |
| Total_Ct_Chng_Q4_Q1 | 0 | 0.000000 |
| Total_Trans_Ct | 0 | 0.000000 |
| Total_Trans_Amt | 0 | 0.000000 |
| Total_Amt_Chng_Q4_Q1 | 0 | 0.000000 |
| Avg_Open_To_Buy | 0 | 0.000000 |
| Total_Revolving_Bal | 0 | 0.000000 |
| Credit_Limit | 0 | 0.000000 |
| Total_Relationship_Count | 0 | 0.000000 |
| Months_Inactive_12_mon | 0 | 0.000000 |
| Attrition_Flag | 0 | 0.000000 |
| Months_on_book | 0 | 0.000000 |
| Card_Category | 0 | 0.000000 |
| Income_Category | 0 | 0.000000 |
| Dependent_count | 0 | 0.000000 |
| Gender | 0 | 0.000000 |
| Customer_Age | 0 | 0.000000 |
| Avg_Utilization_Ratio | 0 | 0.000000 |
# Visual Exploration of Missing Values
# Plot missing values across each columns
plt.title("Missing Values Graph", fontsize=20)
ms.bar(df)
<AxesSubplot:title={'center':'Missing Values Graph'}>
Observations
Education_Level.Marital_Status has 7.4% missing values.# Checking for duplicate records
df.duplicated().sum()
0
Observations
To prevent dataleak, we are going to make a copy of the dataframe specifically for creating our models.
We will treat the corrupt data of Income_Category that contains "abc" as NAN and then use simpleimputer to replace the missing values with the mode. This is strictly for EDA purposes.
The Model copy (model_df) will be used for Model Building and treated differently after splitting to prevent dataleaks.
model_df = df.copy() # Make a copy of the dataframe to be used for building the models
eda_df = df.copy() # Make a copy of the dataframe to be used for EDA.
# Replacing the corrupt data containing 'abc' with NAN
eda_df.replace("abc", np.nan, inplace=True)
# creating an instace of the imputer to be used
imputer = SimpleImputer(strategy="most_frequent")
cols_for_impute = ["Education_Level", "Marital_Status", "Income_Category"]
# Fit & transform the imputer
eda_df[cols_for_impute] = imputer.fit_transform(eda_df[cols_for_impute])
# Create a numerical representation of the Target variable (0s & 1s) for EDA purposes
eda_df["Attrition_Flag_01"] = eda_df["Attrition_Flag"].apply(
lambda x: 1 if x == "Attrited Customer" else 0
)
# Show that the EDA copy of the dataset has been treated while the Model copy is still intact.
print("From EDA Copy of Dataset:\n")
print(eda_df["Income_Category"].value_counts(normalize=True) * 100)
display(eda_df.isna().sum())
print("\nFrom Model Copy of Dataset:\n")
print(model_df["Income_Category"].value_counts(normalize=True) * 100)
display(model_df.isna().sum())
From EDA Copy of Dataset: Less than $40K 46.144 $40K - $60K 17.676 $80K - $120K 15.157 $60K - $80K 13.844 $120K + 7.179 Name: Income_Category, dtype: float64
CLIENTNUM 0 Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 Attrition_Flag_01 0 dtype: int64
From Model Copy of Dataset: Less than $40K 35.163 $40K - $60K 17.676 $80K - $120K 15.157 $60K - $80K 13.844 abc 10.981 $120K + 7.179 Name: Income_Category, dtype: float64
CLIENTNUM 0 Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
Observations
# CLIENTNUM consists of uniques ID for clients and hence will not add value to the modeling
eda_df.drop(["CLIENTNUM"], axis=1, inplace=True)
User Defined Functions
# -----
# User defined function to plot labeled_barplot
# -----
def labeled_barplot(data, feature, perc=False, v_ticks=True, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
if v_ticks is True:
plt.xticks(rotation=90)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# -----
# User defined function to prints the 5 point summary and histogram, box plot,
# and cumulative density distribution plots
# -----
def summary(data, x):
"""
The function prints the 5 point summary and histogram, box plot,
and cumulative density distribution plots for each
feature name passed as the argument.
Parameters:
----------
x: str, feature name
Usage:
------------
summary('age')
"""
x_min = data[x].min()
x_max = data[x].max()
Q1 = data[x].quantile(0.25)
Q2 = data[x].quantile(0.50)
Q3 = data[x].quantile(0.75)
dict = {"Min": x_min, "Q1": Q1, "Q2": Q2, "Q3": Q3, "Max": x_max}
ldf = pd.DataFrame(data=dict, index=["Value"])
print(f"5 Point Summary of {x.capitalize()} Attribute:\n")
print(tabulate(ldf, headers="keys", tablefmt="psql"))
fig, axs = plt.subplots(nrows=3, ncols=1, figsize=(16, 22))
sns.set_palette("Pastel1")
# Histogram
ax1 = sns.distplot(data[x], color="purple", ax=axs[0])
ax1.axvline(np.mean(data[x]), color="purple", linestyle="--")
ax1.axvline(np.median(data[x]), color="black", linestyle="-")
ax1.set_title(f"{x.capitalize()} Density Distribution")
# Boxplot
ax2 = sns.boxplot(
x=data[x], palette="cool", width=0.7, linewidth=0.6, showmeans=True, ax=axs[1]
)
ax2.set_title(f"{x.capitalize()} Boxplot")
# Cummulative plot
ax3 = sns.kdeplot(data[x], cumulative=True, linewidth=1.5, ax=axs[2])
ax3.set_title(f"{x.capitalize()} Cumulative Density Distribution")
plt.subplots_adjust(hspace=0.4)
plt.show()
# -----
# User defined function to plot stacked bar chart
# -----
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 100)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# -----
# User defined function to plot both kde & boxplot of predictor variable wrt target
# -----
def kde_boxplot_wrt_target(data, predictor, target):
# Create the BoxPlot
plt.figure(figsize=(15, 5))
sns.boxplot(data=df, x=target, y=predictor, showmeans=True)
plt.tight_layout()
plt.show()
# Create the KDE plot with hue
sns.kdeplot(
data=eda_df,
x=predictor,
hue=target,
fill=True,
)
# Add labels and title
plt.xlabel(predictor)
plt.ylabel("Density")
Customer_Age¶# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Customer_Age")
5 Point Summary of Customer_age Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 26 | 41 | 46 | 52 | 73 | +-------+-------+------+------+------+-------+
Observations
Months_on_book¶# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Months_on_book")
5 Point Summary of Months_on_book Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 13 | 31 | 36 | 40 | 56 | +-------+-------+------+------+------+-------+
Observations
Credit_Limit¶# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Credit_Limit")
5 Point Summary of Credit_limit Attribute: +-------+--------+------+------+---------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+--------+------+------+---------+-------| | Value | 1438.3 | 2555 | 4549 | 11067.5 | 34516 | +-------+--------+------+------+---------+-------+
Total_Revolving_Bal¶# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Total_Revolving_Bal")
5 Point Summary of Total_revolving_bal Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 0 | 359 | 1276 | 1784 | 2517 | +-------+-------+------+------+------+-------+
Observations
Avg_Open_To_Buy¶# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Avg_Open_To_Buy")
5 Point Summary of Avg_open_to_buy Attribute: +-------+-------+--------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+--------+------+------+-------| | Value | 3 | 1324.5 | 3474 | 9859 | 34516 | +-------+-------+--------+------+------+-------+
Observations
Total_Trans_Ct¶# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Total_Trans_Ct")
5 Point Summary of Total_trans_ct Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 10 | 45 | 67 | 81 | 139 | +-------+-------+------+------+------+-------+
Observations
Total_Amt_Chng_Q4_Q1¶# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Total_Amt_Chng_Q4_Q1")
5 Point Summary of Total_amt_chng_q4_q1 Attribute: +-------+-------+-------+-------+-------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+-------+-------+-------+-------| | Value | 0 | 0.631 | 0.736 | 0.859 | 3.397 | +-------+-------+-------+-------+-------+-------+
Observations
Let's see total transaction amount distributed
Total_Trans_Amt¶# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Total_Trans_Amt")
5 Point Summary of Total_trans_amt Attribute: +-------+-------+--------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+--------+------+------+-------| | Value | 510 | 2155.5 | 3899 | 4741 | 18484 | +-------+-------+--------+------+------+-------+
Observations
Total_Ct_Chng_Q4_Q1¶# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Total_Ct_Chng_Q4_Q1")
5 Point Summary of Total_ct_chng_q4_q1 Attribute: +-------+-------+-------+-------+-------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+-------+-------+-------+-------| | Value | 0 | 0.582 | 0.702 | 0.818 | 3.714 | +-------+-------+-------+-------+-------+-------+
Observations
Avg_Utilization_Ratio¶# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Avg_Utilization_Ratio")
5 Point Summary of Avg_utilization_ratio Attribute: +-------+-------+-------+-------+-------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+-------+-------+-------+-------| | Value | 0 | 0.023 | 0.176 | 0.503 | 0.999 | +-------+-------+-------+-------+-------+-------+
Observations
Dependent_count¶# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Dependent_count", True, False)
Observations
Total_Relationship_Count¶# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Total_Relationship_Count", True, False)
Observations
Months_Inactive_12_mon¶# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Months_Inactive_12_mon", True, False)
Observations
Contacts_Count_12_mon¶# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Contacts_Count_12_mon", True, False)
Observations
Gender¶# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Gender", True, False)
Observations
Education_Level¶# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Education_Level", True, True)
Observations
Marital_Status¶# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Marital_Status", True, False)
Observations
Income_Category¶# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Income_Category", True, True)
Observations
$40K.$40K to $60K.Card_Category¶# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Card_Category", True, False)
Observations
plt.figure(figsize=(15, 7))
sns.heatmap(eda_df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Observations
Avg_Open_To_Buy and Credit_Limit have a perfect positive linear relationship between them...this suggests multicollinearity. Most of the ML Algorithms we are going to use are not affected by multicollinearity.
Avg_Open_To_Buy and Credit_Limit can mean:
◎ Customers are not using their cards.
◎ Customers pay off their credit cards quickly.
Avg_Open_To_Buy and Avg_Utilization_Ratio have a negative corelation as should be.
Customer_Age and Months_on_book have a high correlation. This is to be expected.
Total_Trans_Amt is highly correlated with Total_Tran_Ct because usually the amount tends to get higher as the count of transactions grows# We will draw a pair plot of interesting numerical features from the correlation matrix
features_for_pairplot = [
"Customer_Age",
"Months_on_book",
"Credit_Limit",
"Total_Revolving_Bal",
"Avg_Open_To_Buy",
"Total_Trans_Ct",
"Avg_Utilization_Ratio",
"Attrition_Flag_01",
]
# Create a pairplot
sns.pairplot(eda_df[features_for_pairplot], hue="Attrition_Flag_01")
# Display the plot
plt.show()
Observations
Total_Trans_Ct for the Attrited Customers is lower across board...i.e. vs Customer_Age,Months_on_book, Credit_Limit, Total_Revolving_Bal, Avg_Open_To_Buy, & Avg_Utilization_RatioAttrition_Flag vs Gender¶# Stacked Barplot of Attrition_Flag in comparison to Gender
stacked_barplot(eda_df, "Gender", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Gender All 1627 8500 10127 F 930 4428 5358 M 697 4072 4769 ----------------------------------------------------------------------------------------------------
Observations
Attrition_Flag vs Marital_Status¶# Stacked Barplot of Attrition_Flag in comparison to Marital_Status
stacked_barplot(eda_df, "Marital_Status", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Marital_Status All 1627 8500 10127 Married 838 4598 5436 Single 668 3275 3943 Divorced 121 627 748 ----------------------------------------------------------------------------------------------------
Observations
Attrition_Flag vs Education_Level¶# Stacked Barplot of Attrition_Flag in comparison to Education_Level
stacked_barplot(eda_df, "Education_Level", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Education_Level All 1627 8500 10127 Graduate 743 3904 4647 High School 306 1707 2013 Uneducated 237 1250 1487 College 154 859 1013 Doctorate 95 356 451 Post-Graduate 92 424 516 ----------------------------------------------------------------------------------------------------
Observations
Attrition_Flag vs Income_Category¶# Stacked Barplot of Attrition_Flag in comparison to Income_Category
stacked_barplot(eda_df, "Income_Category", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Income_Category All 1627 8500 10127 Less than $40K 799 3874 4673 $40K - $60K 271 1519 1790 $80K - $120K 242 1293 1535 $60K - $80K 189 1213 1402 $120K + 126 601 727 ----------------------------------------------------------------------------------------------------
Observations
$20K +) and the lowest level of income (Less than $40K) attrited the most.Attrition_Flag vs Total_Relationship_Count¶# Stacked Barplot of Attrition_Flag in comparison to Total_Relationship_Count
stacked_barplot(eda_df, "Total_Relationship_Count", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Total_Relationship_Count All 1627 8500 10127 3 400 1905 2305 2 346 897 1243 1 233 677 910 5 227 1664 1891 4 225 1687 1912 6 196 1670 1866 ----------------------------------------------------------------------------------------------------
Observations
Total_Revolving_Bal vs Attrition_Flag¶# KDE & Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Total_Revolving_Bal", "Attrition_Flag")
Observations
Total_Revolving_Bal.Attrition_Flag vs Credit_Limit¶# KDE & Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Credit_Limit", "Attrition_Flag")
Observations
Credit_Limit does not appear to affect Attrition.Attrition_Flag vs Customer_Age¶# KDE & Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Customer_Age", "Attrition_Flag")
Observations
Customer_Age does not appear to affect Attrition.Total_Trans_Ct vs Attrition_Flag¶# KDE & Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Total_Trans_Ct", "Attrition_Flag")
Observations
Total_Trans_Ct.Total_Trans_Amt vs Attrition_Flag¶# KDE & Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Total_Trans_Amt", "Attrition_Flag")
Observations
Total_Trans_Amt.Avg_Utilization_Ratio vs Attrition_Flag¶# KDE & Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Avg_Utilization_Ratio", "Attrition_Flag")
Observations
Avg_Utilization_Ratio.Attrition_Flag vs Months_on_book¶# KDE & boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Months_on_book", "Attrition_Flag")
Observations
Months_on_book does not appear to affect Attrition.Attrition_Flag vs Total_Revolving_Bal¶# KDE & Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Total_Revolving_Bal", "Attrition_Flag")
Observations
Total_Revolving_Bal.Attrition_Flag vs Avg_Open_To_Buy¶# Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Avg_Open_To_Buy", "Attrition_Flag")
Observations
Avg_Open_To_Buy does not appear to affect Attrition.# Creating a list of numerical variables
num_features = [
"Customer_Age",
"Months_on_book",
"Total_Relationship_Count",
"Months_Inactive_12_mon",
"Contacts_Count_12_mon",
"Credit_Limit",
"Total_Revolving_Bal",
"Avg_Open_To_Buy",
"Total_Amt_Chng_Q4_Q1",
"Total_Trans_Amt",
"Total_Trans_Ct",
"Total_Ct_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
]
# Creating a list of categorical variables
cat_features = [
"Gender",
"Dependent_count",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category",
]
# CLIENTNUM consists of uniques ID for clients and hence will not add value to the modeling
model_df.drop(["CLIENTNUM"], axis=1, inplace=True)
# Replacing the corrupt data containing 'abc' with NAN
model_df.replace("abc", np.nan, inplace=True)
As observed in the EDA, the outliers are actually values that can be possible for the bank and her customers.
According to the business problem and domain knowledge, these outliers are possible values, so we will not drop them. They are "valid outliers". The data looks normal for the domain and no treatment required.
Instead we will use Log Transformations to reduce the negative effect of outliers on our models.
# -----
# Perform a log transformation of the numerical columns
# -----
# Creating a copy of the dataset so we always have a copy on the untreated dataset
model_untreated_df = model_df.copy()
# using log transforms
for col in num_features:
model_df[col] = np.log(model_df[col] + 1)
# -----
# Minmax scaling numeric features
# -----
for col in num_features:
model_df[col] = MinMaxScaler().fit_transform(model_df[[col]])
# Dividing dataset into X and y
X = model_df.drop(["Attrition_Flag"], axis=1)
y = model_df["Attrition_Flag"].apply(lambda x: 1 if x == "Attrited Customer" else 0)
# Splitting data into training, validation and test set:
# first we split data into 2 parts; temporary and test
# then we split temporary into; train and val
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 19) (2026, 19) (2026, 19)
# Show the number of missing values in the feature set.
display(X_train.isna().sum())
print("-" * 30)
display(X_val.isna().sum())
print("-" * 30)
display(X_test.isna().sum())
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 928 Marital_Status 457 Income_Category 654 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
------------------------------
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 294 Marital_Status 140 Income_Category 221 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
------------------------------
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 297 Marital_Status 152 Income_Category 237 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
To prevent data leak, we will implement the fit model on training and transform is applied on training, validation and test separately. That is, what is learned in training is imputed across board (train, val & test). That way the val & test sets remain 'unseen' by train
# creating an instace of the imputer to be used
imputer = SimpleImputer(strategy="most_frequent")
cols_for_impute = ["Education_Level", "Marital_Status", "Income_Category"]
# Fit and transform the train data
X_train[cols_for_impute] = imputer.fit_transform(X_train[cols_for_impute])
# Transform the validation data
X_val[cols_for_impute] = imputer.transform(X_val[cols_for_impute])
# Transform the test data
X_test[cols_for_impute] = imputer.transform(X_test[cols_for_impute])
# Verify that the missing values have been treated.
display(X_train.isna().sum())
print("-" * 30)
display(X_val.isna().sum())
print("-" * 30)
display(X_test.isna().sum())
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
------------------------------
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
------------------------------
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
Observations
# Check the unique values of the categorical variables in train set
cols = X_train.select_dtypes(include=["object"])
for i in cols.columns:
print(X_train[i].value_counts())
print("~" * 35)
F 3193 M 2882 Name: Gender, dtype: int64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Graduate 2782 High School 1228 Uneducated 881 College 618 Post-Graduate 312 Doctorate 254 Name: Education_Level, dtype: int64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Married 3276 Single 2369 Divorced 430 Name: Marital_Status, dtype: int64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Less than $40K 2783 $40K - $60K 1059 $80K - $120K 953 $60K - $80K 831 $120K + 449 Name: Income_Category, dtype: int64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Blue 5655 Silver 339 Gold 69 Platinum 12 Name: Card_Category, dtype: int64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Check the unique values of the categorical variables in validation set
cols = X_val.select_dtypes(include=["object"])
for i in cols.columns:
print(X_val[i].value_counts())
print("~" * 35)
F 1095 M 931 Name: Gender, dtype: int64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Graduate 917 High School 404 Uneducated 306 College 199 Post-Graduate 101 Doctorate 99 Name: Education_Level, dtype: int64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Married 1100 Single 770 Divorced 156 Name: Marital_Status, dtype: int64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Less than $40K 957 $40K - $60K 361 $80K - $120K 293 $60K - $80K 279 $120K + 136 Name: Income_Category, dtype: int64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Blue 1905 Silver 97 Gold 21 Platinum 3 Name: Card_Category, dtype: int64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Check the unique values of the categorical variables in test set
cols = X_test.select_dtypes(include=["object"])
for i in cols.columns:
print(X_test[i].value_counts())
print("~" * 35)
F 1070 M 956 Name: Gender, dtype: int64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Graduate 948 High School 381 Uneducated 300 College 196 Post-Graduate 103 Doctorate 98 Name: Education_Level, dtype: int64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Married 1060 Single 804 Divorced 162 Name: Marital_Status, dtype: int64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Less than $40K 933 $40K - $60K 370 $60K - $80K 292 $80K - $120K 289 $120K + 142 Name: Income_Category, dtype: int64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Blue 1876 Silver 119 Gold 26 Platinum 5 Name: Card_Category, dtype: int64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### Encoding categorical variables
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 29) (2026, 29) (2026, 29)
Observations
The model can make wrong predictions as:
Which case is more important?
How to reduce this loss i.e need to reduce False Negatives?
######
# User defined function to compute different metrics to check performance of a
# classification model built using sklearn
######
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model built using sklearn
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# -----Checking Class Balance in the Dataset
labeled_barplot(eda_df, "Attrition_Flag", True, True)
Observations
Attrition_Flag has a high Class Imbalanced.We have observed previously that the Dependent Variable Attrition_Flag has high class imbalance. Hence we will try out four different class balancing strategies. We define a function below to allow us to modularize the codes.
#We will use SMOTE and RandomOverSampler for OVERSAMPLING
# We will use RandomUnderSampler and NearMiss for UNDERSAMPLING
def Balance_Data(choice):
if choice==1:
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
elif choice==2:
sm = RandomUnderSampler(random_state=1)
elif choice==3:
sm = NearMiss(version=1)
elif choice==4:
sm = RandomOverSampler(random_state=1)
elif choice==0:
return X_train, y_train
#Class balancing technique depending on the choice
X_train_balanced, y_train_balanced = sm.fit_resample(X_train, y_train)
print("Before Class Balancing, counts of label 'Attrited': {}".format(sum(y_train == 1)))
print("Before Class Balancing, counts of label 'Not Attrited': {} \n".format(sum(y_train == 0)))
print("After Class Balancing, counts of label 'Yes': {}".format(sum(y_train_balanced == 1)))
print("After Class Balancing, counts of label 'No': {} \n".format(sum(y_train_balanced == 0)))
print("After Class Balancing, the shape of train_X: {}".format(X_train_balanced.shape))
print("After Class Balancing, the shape of train_y: {} \n".format(y_train_balanced.shape))
return X_train_balanced, y_train_balanced
This function creates a model with default setting based on Class Balancing of CHOICE. The models are: Logistic Regression, Decision Tree, Bagging, Random forest, GradientBoost & Xgboost
def Model_Creation(CHOICE):
"""
This function creates a model with default setting based on Class Balancing of CHOICE.
The models are: Logistic regression, dtree, Bagging, Random forest, GradientBoost & Xgboost
CHOICE = CHOICE of Class Balancing
"""
models = [] # Empty list to store all the models
# --------------------------Appending models into the list--------------------------
models.append(("Logistic Regression", LogisticRegression(random_state=1)))
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
# -----------------------------------------------------------------------------------
scores = [] # Empty list to store all model's Recall scores
scores_val = [] # Empty list to store all model's Recall scores for Validation data
names = [] # Empty list to store name of the models
cv_results = [] # Empty list to store all model's CV scores
# CHOICE =0 returns X_train & y_train... 1 returns SMOTE balanced...etc...
X_bal, y_bal = Balance_Data(CHOICE)
# --------------------------loop through all models to get Recall score for train & val--------------------------
print("\n" "Training Performance (RECALL):" "\n")
for name, model in models:
model.fit(X_bal, y_bal)
score = recall_score(y_bal, model.predict(X_bal))
print("{}: {}".format(name, score))
scores.append(score)
names.append(name)
print("\n" "Validation Performance (RECALL):" "\n")
for name, model in models:
model.fit(X_bal, y_bal)
score_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, score_val))
scores_val.append(score_val)
# -----------loop through all models to get cross validation results mean----------------------------------------
print("\n" "Cross-Validation Performance (Mean):" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
cv_result = cross_val_score(
estimator=model, X=X_bal, y=y_bal, scoring=scoring, cv=kfold
)
cv_results.append(cv_result)
print("{}: {}".format(name, cv_result.mean() * 100))
return scores, scores_val, cv_results, names
scores_IB, scores_val_IB, cv_results_IB, names = Model_Creation(0)
Training Performance (RECALL): Logistic Regression: 0.4989754098360656 Decision Tree: 1.0 Bagging: 0.985655737704918 Random Forest: 1.0 GBM: 0.875 Xgboost: 1.0 Validation Performance (RECALL): Logistic Regression: 0.5858895705521472 Decision Tree: 0.8159509202453987 Bagging: 0.8098159509202454 Random Forest: 0.803680981595092 GBM: 0.8558282208588958 Xgboost: 0.8834355828220859 Cross-Validation Performance (Mean): Logistic Regression: 48.66509680795395 Decision Tree: 77.76556776556777 Bagging: 78.48194662480377 Random Forest: 75.71480900052329 GBM: 81.24646781789639 Xgboost: 86.57613814756672
scores_SMOTE, scores_val_SMOTE, cv_results_SMOTE, names = Model_Creation(1)
Before Class Balancing, counts of label 'Attrited': 976 Before Class Balancing, counts of label 'Not Attrited': 5099 After Class Balancing, counts of label 'Yes': 5099 After Class Balancing, counts of label 'No': 5099 After Class Balancing, the shape of train_X: (10198, 29) After Class Balancing, the shape of train_y: (10198,) Training Performance (RECALL): Logistic Regression: 0.8633065306922926 Decision Tree: 1.0 Bagging: 0.9962737791723868 Random Forest: 1.0 GBM: 0.9760737399490096 Xgboost: 1.0 Validation Performance (RECALL): Logistic Regression: 0.8159509202453987 Decision Tree: 0.843558282208589 Bagging: 0.8374233128834356 Random Forest: 0.8558282208588958 GBM: 0.8926380368098159 Xgboost: 0.8926380368098159 Cross-Validation Performance (Mean): Logistic Regression: 85.97812157247591 Decision Tree: 93.68488906848313 Bagging: 94.86185995497316 Random Forest: 97.17598953222112 GBM: 96.92097211799341 Xgboost: 98.05849641132212
scores_RUS, scores_val_RUS, cv_results_RUS, names = Model_Creation(2)
Before Class Balancing, counts of label 'Attrited': 976 Before Class Balancing, counts of label 'Not Attrited': 5099 After Class Balancing, counts of label 'Yes': 976 After Class Balancing, counts of label 'No': 976 After Class Balancing, the shape of train_X: (1952, 29) After Class Balancing, the shape of train_y: (1952,) Training Performance (RECALL): Logistic Regression: 0.8258196721311475 Decision Tree: 1.0 Bagging: 0.9907786885245902 Random Forest: 1.0 GBM: 0.9805327868852459 Xgboost: 1.0 Validation Performance (RECALL): Logistic Regression: 0.8466257668711656 Decision Tree: 0.9202453987730062 Bagging: 0.9325153374233128 Random Forest: 0.9386503067484663 GBM: 0.9570552147239264 Xgboost: 0.9570552147239264 Cross-Validation Performance (Mean): Logistic Regression: 81.6624803767661 Decision Tree: 88.52433281004708 Bagging: 90.7796964939822 Random Forest: 93.34013605442178 GBM: 94.26216640502355 Xgboost: 95.28833071690215
scores_NM, scores_val_NM, cv_results_NM, names = Model_Creation(3)
Before Class Balancing, counts of label 'Attrited': 976 Before Class Balancing, counts of label 'Not Attrited': 5099 After Class Balancing, counts of label 'Yes': 976 After Class Balancing, counts of label 'No': 976 After Class Balancing, the shape of train_X: (1952, 29) After Class Balancing, the shape of train_y: (1952,) Training Performance (RECALL): Logistic Regression: 0.8391393442622951 Decision Tree: 1.0 Bagging: 0.992827868852459 Random Forest: 1.0 GBM: 0.9877049180327869 Xgboost: 1.0 Validation Performance (RECALL): Logistic Regression: 0.8926380368098159 Decision Tree: 0.8957055214723927 Bagging: 0.911042944785276 Random Forest: 0.9601226993865031 GBM: 0.9631901840490797 Xgboost: 0.9693251533742331 Cross-Validation Performance (Mean): Logistic Regression: 82.27472527472527 Decision Tree: 88.11355311355312 Bagging: 90.98430141287285 Random Forest: 95.49031920460493 GBM: 94.26111983254842 Xgboost: 95.18367346938777
scores_ROS, scores_val_ROS, cv_results_ROS, names = Model_Creation(4)
Before Class Balancing, counts of label 'Attrited': 976 Before Class Balancing, counts of label 'Not Attrited': 5099 After Class Balancing, counts of label 'Yes': 5099 After Class Balancing, counts of label 'No': 5099 After Class Balancing, the shape of train_X: (10198, 29) After Class Balancing, the shape of train_y: (10198,) Training Performance (RECALL): Logistic Regression: 0.8458521278682094 Decision Tree: 1.0 Bagging: 1.0 Random Forest: 1.0 GBM: 0.9811727789762699 Xgboost: 1.0 Validation Performance (RECALL): Logistic Regression: 0.8404907975460123 Decision Tree: 0.7699386503067485 Bagging: 0.8128834355828221 Random Forest: 0.843558282208589 GBM: 0.9386503067484663 Xgboost: 0.9355828220858896 Cross-Validation Performance (Mean): Logistic Regression: 84.17337258750409 Decision Tree: 99.80392156862746 Bagging: 99.70588235294117 Random Forest: 99.78429448324964 GBM: 97.74455925647982 Xgboost: 99.9607843137255
pd.DataFrame(cv_results_ROS).mean(axis=1)
0 0.842 1 0.998 2 0.997 3 0.998 4 0.977 5 1.000 dtype: float64
# Training performance comparison
models_train_comp_df = pd.concat(
[
pd.DataFrame(scores_IB),
pd.DataFrame(scores_SMOTE),
pd.DataFrame(scores_RUS),
pd.DataFrame(scores_NM),
pd.DataFrame(scores_ROS),
],
axis=1,
)
models_train_comp_df.index = [names]
models_train_comp_df.columns = ["Original", "SMOTE", "RUS", "NearMiss", "ROS"]
# Print out the Training performance comparison matrix
print("Training performance comparison:")
models_train_comp_df.T
Training performance comparison:
| Logistic Regression | Decision Tree | Bagging | Random Forest | GBM | Xgboost | |
|---|---|---|---|---|---|---|
| Original | 0.499 | 1.000 | 0.986 | 1.000 | 0.875 | 1.000 |
| SMOTE | 0.863 | 1.000 | 0.996 | 1.000 | 0.976 | 1.000 |
| RUS | 0.826 | 1.000 | 0.991 | 1.000 | 0.981 | 1.000 |
| NearMiss | 0.839 | 1.000 | 0.993 | 1.000 | 0.988 | 1.000 |
| ROS | 0.846 | 1.000 | 1.000 | 1.000 | 0.981 | 1.000 |
# Cross validation performance comparison (Mean)
models_cv_comp_df = pd.concat(
[
pd.DataFrame(cv_results_IB).mean(
axis=1
), # Mean of Cross Validation results across Rows
pd.DataFrame(cv_results_SMOTE).mean(axis=1),
pd.DataFrame(cv_results_RUS).mean(axis=1),
pd.DataFrame(cv_results_NM).mean(axis=1),
pd.DataFrame(cv_results_ROS).mean(axis=1),
],
axis=1,
)
models_cv_comp_df.index = [names]
models_cv_comp_df.columns = ["Original", "SMOTE", "RUS", "NearMiss", "ROS"]
# Print out the Cross Validation performance comparison matrix
print("Cross Validation performance (Mean) comparison:")
models_cv_comp_df.T
Cross Validation performance (Mean) comparison:
| Logistic Regression | Decision Tree | Bagging | Random Forest | GBM | Xgboost | |
|---|---|---|---|---|---|---|
| Original | 0.487 | 0.778 | 0.785 | 0.757 | 0.812 | 0.866 |
| SMOTE | 0.860 | 0.937 | 0.949 | 0.972 | 0.969 | 0.981 |
| RUS | 0.817 | 0.885 | 0.908 | 0.933 | 0.943 | 0.953 |
| NearMiss | 0.823 | 0.881 | 0.910 | 0.955 | 0.943 | 0.952 |
| ROS | 0.842 | 0.998 | 0.997 | 0.998 | 0.977 | 1.000 |
Observations
◎ From our observation above; it is clear that the Recall Scores of the Balanced (Oversample & Undersample) data are better than the Original (Imbalanced) data for all our 6 models.
◎ So we use the models of the Original (Imbalanced) data as the baseline, knowing that the models from the corresponding Balanced (Oversample or Undersample) data will give the best scores.
◎ This will help us narrow down from 6 algorithms (30 models) to the best 4 algorithms (16 models - excluding the models of the Original (Imbalanced) data).
◎ This will reduce the models of interest to 6 models (out of 16 models)
# Plotting boxplots for CV scores of all models defined above for the Original (Imbalanced) Data
fig = plt.figure()
fig.suptitle("Algorithm Comparison: Original (Imbalance) Data")
ax = fig.add_subplot(111)
plt.boxplot(cv_results_IB)
ax.set_xticklabels(names, rotation="vertical")
plt.show()
Observations
Each has 4 Models based on our 4 Data Balancing Strategies (SMOTE, RandomUndersampler, NearMiss & RandomOversampler)...making 16 models in total.
We will carry out further analysis to arrive at 5 models.
#---------------------
# Print the Cross Validation Statistics of the 16 models: mean, max and STD
# These statistics tell the story of Bias & Variance and we will able to sift the models down to 6
#---------------------
print("------------------------Bagging Cross Validation Statistics--------------------------")
print('For SMOTE')
print('Mean CV Score:', cv_results_SMOTE[2].mean())
print('Max CV Score:', cv_results_SMOTE[2].max())
print('CV STD:', cv_results_SMOTE[2].std())
print('\n')
print("~" * 35)
print('For RUS')
print('Mean CV Score:', cv_results_RUS[2].mean())
print('Max CV Score:', cv_results_RUS[2].max())
print('CV STD:', cv_results_RUS[2].std())
print('\n')
print("~" * 35)
print('For NearMiss')
print('Mean CV Score:', cv_results_NM[2].mean())
print('Max CV Score:', cv_results_NM[2].max())
print('CV STD:', cv_results_NM[2].std())
print('\n')
print("~" * 35)
print('For ROS')
print('Mean CV Score:', cv_results_ROS[2].mean())
print('Max CV Score:', cv_results_ROS[2].max())
print('CV STD:', cv_results_ROS[2].std())
print('\n')
print("------------------------Random Forest Cross Validation Statistics--------------------------")
print('For SMOTE')
print('Mean CV Score:', cv_results_SMOTE[3].mean())
print('Max CV Score:', cv_results_SMOTE[3].max())
print('CV STD:', cv_results_SMOTE[3].std())
print('\n')
print("~" * 35)
print('For RUS')
print('Mean CV Score:', cv_results_RUS[3].mean())
print('Max CV Score:', cv_results_RUS[3].max())
print('CV STD:', cv_results_RUS[3].std())
print('\n')
print("~" * 35)
print('For NearMiss')
print('Mean CV Score:', cv_results_NM[3].mean())
print('Max CV Score:', cv_results_NM[3].max())
print('CV STD:', cv_results_NM[3].std())
print('\n')
print("~" * 35)
print('For ROS')
print('Mean CV Score:', cv_results_ROS[3].mean())
print('Max CV Score:', cv_results_ROS[3].max())
print('CV STD:', cv_results_ROS[3].std())
print('\n')
print("------------------------GradientBoost Cross Validation Statistics--------------------------")
print('For SMOTE')
print('Mean CV Score:', cv_results_SMOTE[4].mean())
print('Max CV Score:', cv_results_SMOTE[4].max())
print('CV STD:', cv_results_SMOTE[4].std())
print('\n')
print("~" * 35)
print('For RUS')
print('Mean CV Score:', cv_results_RUS[4].mean())
print('Max CV Score:', cv_results_RUS[4].max())
print('CV STD:', cv_results_RUS[4].std())
print('\n')
print("~" * 35)
print('For NearMiss')
print('Mean CV Score:', cv_results_NM[4].mean())
print('Max CV Score:', cv_results_NM[4].max())
print('CV STD:', cv_results_NM[4].std())
print('\n')
print("~" * 35)
print('For ROS')
print('Mean CV Score:', cv_results_ROS[4].mean())
print('Max CV Score:', cv_results_ROS[4].max())
print('CV STD:', cv_results_ROS[4].std())
print('\n')
print("------------------------XGBoost Cross Validation Statistics--------------------------")
print('For SMOTE')
print('Mean CV Score:', cv_results_SMOTE[5].mean())
print('Max CV Score:', cv_results_SMOTE[5].max())
print('CV STD:', cv_results_SMOTE[5].std())
print('\n')
print("~" * 35)
print('For RUS')
print('Mean CV Score:', cv_results_RUS[5].mean())
print('Max CV Score:', cv_results_RUS[5].max())
print('CV STD:', cv_results_RUS[5].std())
print('\n')
print("~" * 35)
print('For NearMiss')
print('Mean CV Score:', cv_results_NM[5].mean())
print('Max CV Score:', cv_results_NM[5].max())
print('CV STD:', cv_results_NM[5].std())
print('\n')
print("~" * 35)
print('For ROS')
print('Mean CV Score:', cv_results_ROS[5].mean())
print('Max CV Score:', cv_results_ROS[5].max())
print('CV STD:', cv_results_ROS[5].std())
------------------------Bagging Cross Validation Statistics-------------------------- For SMOTE Mean CV Score: 0.9486185995497316 Max CV Score: 0.9548577036310107 CV STD: 0.00421964722419514 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For RUS Mean CV Score: 0.907796964939822 Max CV Score: 0.9384615384615385 CV STD: 0.0167904933180432 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For NearMiss Mean CV Score: 0.9098430141287285 Max CV Score: 0.9333333333333333 CV STD: 0.012336736910721664 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For ROS Mean CV Score: 0.9970588235294118 Max CV Score: 1.0 CV STD: 0.003100272215851341 ------------------------Random Forest Cross Validation Statistics-------------------------- For SMOTE Mean CV Score: 0.9717598953222113 Max CV Score: 0.9754901960784313 CV STD: 0.004652764434332977 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For RUS Mean CV Score: 0.9334013605442177 Max CV Score: 0.9641025641025641 CV STD: 0.0197290681683787 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For NearMiss Mean CV Score: 0.9549031920460493 Max CV Score: 0.9693877551020408 CV STD: 0.010992120364461887 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For ROS Mean CV Score: 0.9978429448324964 Max CV Score: 0.9990196078431373 CV STD: 0.0011431259475611552 ------------------------GradientBoost Cross Validation Statistics-------------------------- For SMOTE Mean CV Score: 0.9692097211799341 Max CV Score: 0.9715686274509804 CV STD: 0.0020176271846683437 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For RUS Mean CV Score: 0.9426216640502355 Max CV Score: 0.9538461538461539 CV STD: 0.010465803244080312 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For NearMiss Mean CV Score: 0.9426111983254841 Max CV Score: 0.9540816326530612 CV STD: 0.008883415425976452 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For ROS Mean CV Score: 0.9774455925647981 Max CV Score: 0.9803921568627451 CV STD: 0.002712530311165646 ------------------------XGBoost Cross Validation Statistics-------------------------- For SMOTE Mean CV Score: 0.9805849641132213 Max CV Score: 0.9833333333333333 CV STD: 0.00368676461417039 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For RUS Mean CV Score: 0.9528833071690215 Max CV Score: 0.9692307692307692 CV STD: 0.010373724357350156 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For NearMiss Mean CV Score: 0.9518367346938776 Max CV Score: 0.9591836734693877 CV STD: 0.006999809706174566 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For ROS Mean CV Score: 0.9996078431372549 Max CV Score: 1.0 CV STD: 0.0007843137254901933
Observations
Agregating the CV scores for the 6 models we are interested in...
# Models that Qualified so far
q_models = [
"Bagging_ROS",
"RandomForest_ROS",
"GradientBoost_ROS",
"XGBoost_ROS",
"RandomForest_SMOTE",
"XGBoost_SMOTE",
]
# Cross Validation Score (CVS) labels
cvs_labels = ["cvs_1", "cvs_2", "cvs_3", "cvs_4", "cvs_5"]
# Define an empty list
cv_scores = []
# Append the cross validation scores
cv_scores.append(cv_results_ROS[2])
cv_scores.append(cv_results_ROS[3])
cv_scores.append(cv_results_ROS[4])
cv_scores.append(cv_results_ROS[5])
cv_scores.append(cv_results_SMOTE[3])
cv_scores.append(cv_results_SMOTE[5])
# Convert to Dataframe
cv_scores_df = pd.DataFrame(cv_scores, index=q_models, columns=cvs_labels)
cv_scores_df = cv_scores_df.T
cv_scores_df
| Bagging_ROS | RandomForest_ROS | GradientBoost_ROS | XGBoost_ROS | RandomForest_SMOTE | XGBoost_SMOTE | |
|---|---|---|---|---|---|---|
| cvs_1 | 0.991 | 0.996 | 0.980 | 1.000 | 0.968 | 0.974 |
| cvs_2 | 0.999 | 0.998 | 0.977 | 1.000 | 0.975 | 0.983 |
| cvs_3 | 0.998 | 0.997 | 0.979 | 0.998 | 0.975 | 0.982 |
| cvs_4 | 1.000 | 0.999 | 0.973 | 1.000 | 0.975 | 0.983 |
| cvs_5 | 0.997 | 0.999 | 0.977 | 1.000 | 0.965 | 0.980 |
# Create the boxplot
plt.boxplot(cv_scores_df)
# Customize the x-tick labels
plt.xticks(
[1, 2, 3, 4, 5, 6],
[
"Bagging_ROS",
"RandomForest_ROS",
"GradientBoost_ROS",
"XGBoost_ROS",
"RandomForest_SMOTE",
"XGBoost_SMOTE",
],
rotation="vertical",
)
plt.show()
Observations
Bagging_ROS, RandomForest_ROS & XGBoost_ROS are showing a lot of promise.Calculating the Confidence Interval for the models we are interested in...
# Calculate the confidence interval for each model's scores
# A confidence interval of 95% lies between 2.5% and 97.5%
lower_percentile = 2.5 # Lower percentile for the confidence interval
upper_percentile = 97.5 # Upper percentile for the confidence interval
confidence_intervals_df = cv_scores_df.quantile(
[lower_percentile / 100, upper_percentile / 100]
)
# Transpose the DataFrame for a better display
confidence_intervals_df = confidence_intervals_df.T
confidence_intervals_df.columns = ["Lower CI", "Upper CI"]
confidence_intervals_df
| Lower CI | Upper CI | |
|---|---|---|
| Bagging_ROS | 0.992 | 1.000 |
| RandomForest_ROS | 0.996 | 0.999 |
| GradientBoost_ROS | 0.973 | 0.980 |
| XGBoost_ROS | 0.998 | 1.000 |
| RandomForest_SMOTE | 0.965 | 0.975 |
| XGBoost_SMOTE | 0.974 | 0.983 |
Getting the Recall Scores of Running the Models on Validation set...
Then we compare the scores of the 6 models of interest to see if they lie within the confidence intervals
# Validation performance comparison
models_val_comp_df = pd.concat(
[
pd.DataFrame(scores_val_IB),
pd.DataFrame(scores_val_SMOTE),
pd.DataFrame(scores_val_RUS),
pd.DataFrame(scores_val_NM),
pd.DataFrame(scores_val_ROS),
],
axis=1,
)
models_val_comp_df.index = [names]
models_val_comp_df.columns = ["Original", "SMOTE", "RUS", "NearMiss", "ROS"]
# Print out the Validation performance comparison matrix
print("Validation Performance Comparison:")
models_val_comp_df.T
Validation Performance Comparison:
| Logistic Regression | Decision Tree | Bagging | Random Forest | GBM | Xgboost | |
|---|---|---|---|---|---|---|
| Original | 0.586 | 0.816 | 0.810 | 0.804 | 0.856 | 0.883 |
| SMOTE | 0.816 | 0.844 | 0.837 | 0.856 | 0.893 | 0.893 |
| RUS | 0.847 | 0.920 | 0.933 | 0.939 | 0.957 | 0.957 |
| NearMiss | 0.893 | 0.896 | 0.911 | 0.960 | 0.963 | 0.969 |
| ROS | 0.840 | 0.770 | 0.813 | 0.844 | 0.939 | 0.936 |
#######
# We will add the validation performance scores beside the Confidence Intervals
# so that it will be easy for us to compare the results.
#######
val_perf = [] # create an emprty list
# Append the cross validation performance scores
val_perf.append(scores_val_ROS[2])
val_perf.append(scores_val_ROS[3])
val_perf.append(scores_val_ROS[4])
val_perf.append(scores_val_ROS[5])
val_perf.append(scores_val_SMOTE[3])
val_perf.append(scores_val_SMOTE[5])
# Add the list to the CI dataframe
confidence_intervals_df["Val_Perf"] = val_perf
confidence_intervals_df
| Lower CI | Upper CI | Val_Perf | |
|---|---|---|---|
| Bagging_ROS | 0.992 | 1.000 | 0.813 |
| RandomForest_ROS | 0.996 | 0.999 | 0.844 |
| GradientBoost_ROS | 0.973 | 0.980 | 0.939 |
| XGBoost_ROS | 0.998 | 1.000 | 0.936 |
| RandomForest_SMOTE | 0.965 | 0.975 | 0.856 |
| XGBoost_SMOTE | 0.974 | 0.983 | 0.893 |
Observations
######
## Tune Bagging with ROS using RandomizedSearchCV
######
# Define the base estimator
base_estimator = DecisionTreeClassifier(random_state=1)
# Define the Bagging classifier
bgc = BaggingClassifier(base_estimator=base_estimator, random_state=1)
# Define the hyperparameter grid
param_grid = {
'n_estimators': np.arange(10, 100, 10),
'max_samples': np.arange(0.1, 1.1, 0.1),
'max_features': np.arange(0.1, 1.1, 0.1),
'bootstrap': [True, False],
'base_estimator': [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
DecisionTreeClassifier(max_depth=4, random_state=1),
DecisionTreeClassifier(max_depth=5, random_state=1),
],
}
# Perform random search cross-validation
bag_ROS_tuned = RandomizedSearchCV(bgc, param_distributions=param_grid, n_iter=50, cv=5, scoring='recall', random_state=1, n_jobs = -1)
X_train_ROS, y_train_ROS = Balance_Data(4) #Generate Oversampled data using ROS - option 4
bag_ROS_tuned.fit(X_train_ROS, y_train_ROS)
Before Class Balancing, counts of label 'Attrited': 976 Before Class Balancing, counts of label 'Not Attrited': 5099 After Class Balancing, counts of label 'Yes': 5099 After Class Balancing, counts of label 'No': 5099 After Class Balancing, the shape of train_X: (10198, 29) After Class Balancing, the shape of train_y: (10198,)
RandomizedSearchCV(cv=5,
estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1),
random_state=1),
n_iter=50, n_jobs=-1,
param_distributions={'base_estimator': [DecisionTreeClassifier(max_depth=1,
random_state=1),
DecisionTreeClassifier(max_depth=2,
random_state=1),
DecisionTreeClassifier(max_depth=3,
random_state=1),
DecisionTreeClassifier(max_depth=4,
random_state=1),
DecisionTreeClassifier(max_depth=5,
random_state=1)],
'bootstrap': [True, False],
'max_features': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
'max_samples': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90])},
random_state=1, scoring='recall')
# checking the model performance on Validation set
bag_ROS_model_val_perf = model_performance_classification_sklearn(
bag_ROS_tuned, X_val, y_val
)
print("bag_ROS_tuned Validation performance:")
bag_ROS_model_val_perf
bag_ROS_tuned Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.928 | 0.945 | 0.708 | 0.809 |
######
## Tune RandomForest with ROS using RandomizedSearchCV
######
# Define the Random Forest classifier
rf = RandomForestClassifier(random_state=1)
# Define the hyperparameter grid
param_grid = {
'n_estimators': np.arange(100, 1000, 100),
'max_depth': np.arange(2, 20),
'min_samples_leaf': np.arange(1, 10),
'max_features': np.arange(0.2, 0.8, 0.1),
'criterion': ['gini', 'entropy'],
'bootstrap': [True, False],
'class_weight': ['balanced', 'balanced_subsample'],
'min_impurity_decrease':[0.001, 0.002, 0.003]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Perform random search cross-validation
rf_ROS_tuned = RandomizedSearchCV(rf, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
X_train_ROS, y_train_ROS = Balance_Data(4) #Generate Oversampled data using ROS - option 4
rf_ROS_tuned.fit(X_train_ROS, y_train_ROS)
Before Class Balancing, counts of label 'Attrited': 976 Before Class Balancing, counts of label 'Not Attrited': 5099 After Class Balancing, counts of label 'Yes': 5099 After Class Balancing, counts of label 'No': 5099 After Class Balancing, the shape of train_X: (10198, 29) After Class Balancing, the shape of train_y: (10198,)
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=1),
n_iter=50, n_jobs=-1,
param_distributions={'bootstrap': [True, False],
'class_weight': ['balanced',
'balanced_subsample'],
'criterion': ['gini', 'entropy'],
'max_depth': array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19]),
'max_features': array([0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]),
'min_impurity_decrease': [0.001, 0.002,
0.003],
'min_samples_leaf': array([1, 2, 3, 4, 5, 6, 7, 8, 9]),
'n_estimators': array([100, 200, 300, 400, 500, 600, 700, 800, 900])},
random_state=1, scoring=make_scorer(recall_score))
# checking the model performance for Validation set
rf_ROS_model_val_perf = model_performance_classification_sklearn(
rf_ROS_tuned, X_val, y_val
)
print("rf_SMOTE Validation performance:")
rf_ROS_model_val_perf
rf_SMOTE Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.950 | 0.911 | 0.803 | 0.853 |
######
## Tune GradientBoost with ROS using RandomizedSearchCV
######
# Define the Gradient Boosting classifier
gb = GradientBoostingClassifier()
# Define the hyperparameter grid
param_grid = {
'n_estimators': np.arange(100, 1000, 100),
'learning_rate': [0.1, 0.01, 0.001],
'max_depth': np.arange(2, 10),
'min_samples_split': np.arange(2, 10),
'min_samples_leaf': np.arange(1, 10),
'max_features': ['sqrt', 'log2'],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Perform random search cross-validation
gb_ROS_tuned = RandomizedSearchCV(gb, param_distributions=param_grid, n_iter=50, cv=5, scoring=scorer, random_state=1, n_jobs = -1)
#Generate Oversampled data using RandomOversampler - option 4
X_train_ROS, y_train_ROS = Balance_Data(4)
#Fitting the model oversampled (ROS) train set
gb_ROS_tuned.fit(X_train_ROS, y_train_ROS)
Before Class Balancing, counts of label 'Attrited': 976 Before Class Balancing, counts of label 'Not Attrited': 5099 After Class Balancing, counts of label 'Yes': 5099 After Class Balancing, counts of label 'No': 5099 After Class Balancing, the shape of train_X: (10198, 29) After Class Balancing, the shape of train_y: (10198,)
RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(), n_iter=50,
n_jobs=-1,
param_distributions={'learning_rate': [0.1, 0.01, 0.001],
'max_depth': array([2, 3, 4, 5, 6, 7, 8, 9]),
'max_features': ['sqrt', 'log2'],
'min_samples_leaf': array([1, 2, 3, 4, 5, 6, 7, 8, 9]),
'min_samples_split': array([2, 3, 4, 5, 6, 7, 8, 9]),
'n_estimators': array([100, 200, 300, 400, 500, 600, 700, 800, 900])},
random_state=1, scoring=make_scorer(recall_score))
# checking the model performance for Validation set
gb_ROS_model_val_perf = model_performance_classification_sklearn(
gb_ROS_tuned, X_val, y_val
)
print("gb_ROS_tuned Validation performance:")
gb_ROS_model_val_perf
gb_ROS_tuned Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.973 | 0.899 | 0.933 | 0.916 |
######
## Tune XGBoost with ROS using RandomizedSearchCV
######
# defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,200,50),
'scale_pos_weight':[1,2,5,10,12,15],
'learning_rate':[0.01,0.1,0.2,0.05],
'gamma':[0,1,3,5],
'subsample':[0.7,0.8,0.9,1],
'max_depth':np.arange(1,7,1),
'reg_lambda':[5,10,12,15]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
XGBoost_ROS_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Generate Oversampled data using RandomOversampler - option 4
X_train_ROS, y_train_ROS = Balance_Data(4)
#Fitting parameters in RandomizedSearchCV
XGBoost_ROS_tuned.fit(X_train_ROS, y_train_ROS)
# Access the best estimator from the RandomizedSearchCV
XGBoost_ROS_tuned.best_params_
Before Class Balancing, counts of label 'Attrited': 976 Before Class Balancing, counts of label 'Not Attrited': 5099 After Class Balancing, counts of label 'Yes': 5099 After Class Balancing, counts of label 'No': 5099 After Class Balancing, the shape of train_X: (10198, 29) After Class Balancing, the shape of train_y: (10198,)
{'subsample': 1,
'scale_pos_weight': 12,
'reg_lambda': 10,
'n_estimators': 150,
'max_depth': 1,
'learning_rate': 0.01,
'gamma': 0}
# Initiate XGBoost with the best parameters of RandomizedSearchCV
XGBoost_ROS_tuned = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=1,
scale_pos_weight=12,
reg_lambda=10,
n_estimators=150,
max_depth=1,
learning_rate=0.01,
gamma=0,
)
# Fitting
XGBoost_ROS_tuned.fit(X_train_ROS, y_train_ROS)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=0, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.01, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=1,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=150, n_jobs=None,
num_parallel_tree=None, predictor=None, random_state=1, ...)
# checking the model performance for Validation set
XGBoost_ROS_model_val_perf = model_performance_classification_sklearn(
XGBoost_ROS_tuned, X_val, y_val
)
print("XGBoost_ROS Validation performance:")
XGBoost_ROS_model_val_perf
XGBoost_ROS Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.161 | 1.000 | 0.161 | 0.277 |
######
## Tune RandomForest with SMOTE using RandomizedSearchCV
######
# Choose the type of classifier.
rf2 = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {"n_estimators": [150,200,250],
"min_samples_leaf": np.arange(5, 10),
"max_features": np.arange(0.2, 0.7, 0.1),
"max_samples": np.arange(0.3, 0.7, 0.1),
"max_depth":np.arange(3,4,5),
"class_weight" : ['balanced', 'balanced_subsample'],
"min_impurity_decrease":[0.001, 0.002, 0.003]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the random search
rf_SMOTE_tuned = RandomizedSearchCV(rf2, parameters,n_iter=30, scoring=scorer,cv=5, random_state = 1, n_jobs = -1)
X_train_SMOTE, y_train_SMOTE = Balance_Data(1) #Generate Oversampled data using SMOTE - option 1
rf_SMOTE_tuned.fit(X_train_SMOTE, y_train_SMOTE)
Before Class Balancing, counts of label 'Attrited': 976 Before Class Balancing, counts of label 'Not Attrited': 5099 After Class Balancing, counts of label 'Yes': 5099 After Class Balancing, counts of label 'No': 5099 After Class Balancing, the shape of train_X: (10198, 29) After Class Balancing, the shape of train_y: (10198,)
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=1),
n_iter=30, n_jobs=-1,
param_distributions={'class_weight': ['balanced',
'balanced_subsample'],
'max_depth': array([3]),
'max_features': array([0.2, 0.3, 0.4, 0.5, 0.6]),
'max_samples': array([0.3, 0.4, 0.5, 0.6]),
'min_impurity_decrease': [0.001, 0.002,
0.003],
'min_samples_leaf': array([5, 6, 7, 8, 9]),
'n_estimators': [150, 200, 250]},
random_state=1, scoring=make_scorer(recall_score))
# checking the model performance for Validation set
rf_SMOTE_model_val_perf = model_performance_classification_sklearn(
rf_SMOTE_tuned, X_val, y_val
)
print("rf_SMOTE_tuned Validation performance:")
rf_SMOTE_model_val_perf
rf_SMOTE_tuned Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.870 | 0.902 | 0.560 | 0.691 |
######
## Tune XGBoost with SMOTE using RandomizedSearchCV
######
# defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,200,50),
'scale_pos_weight':[1,2,5,10,12,15],
'learning_rate':[0.01,0.1,0.2,0.05],
'gamma':[0,1,3,5],
'subsample':[0.7,0.8,0.9,1],
'max_depth':np.arange(1,7,1),
'reg_lambda':[5,10,12,15]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
XGBoost_SMOTE_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Generate Oversampled data using SMOTE - option 1
X_train_SMOTE, y_train_SMOTE = Balance_Data(1)
#Fitting parameters in RandomizedSearchCV
XGBoost_SMOTE_tuned.fit(X_train_SMOTE, y_train_SMOTE)
# Access the best estimator from the RandomizedSearchCV
XGBoost_SMOTE_tuned.best_params_
Before Class Balancing, counts of label 'Attrited': 976 Before Class Balancing, counts of label 'Not Attrited': 5099 After Class Balancing, counts of label 'Yes': 5099 After Class Balancing, counts of label 'No': 5099 After Class Balancing, the shape of train_X: (10198, 29) After Class Balancing, the shape of train_y: (10198,)
{'subsample': 1,
'scale_pos_weight': 15,
'reg_lambda': 10,
'n_estimators': 100,
'max_depth': 2,
'learning_rate': 0.01,
'gamma': 0}
# Initiate XGBoost with the best parameters of RandomizedSearchCV
XGBoost_SMOTE_tuned = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=1,
scale_pos_weight=15,
reg_lambda=10,
n_estimators=100,
max_depth=2,
learning_rate=0.01,
gamma=0,
)
# Fitting
XGBoost_SMOTE_tuned.fit(X_train_SMOTE, y_train_SMOTE)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=0, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.01, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=2,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100, n_jobs=None,
num_parallel_tree=None, predictor=None, random_state=1, ...)
# checking the model performance for Validation set
XGBoost_SMOTE_model_val_perf = model_performance_classification_sklearn(
XGBoost_SMOTE_tuned, X_val, y_val
)
print("XGBoost_SMOTE Validation performance:")
XGBoost_SMOTE_model_val_perf
XGBoost_SMOTE Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.535 | 1.000 | 0.257 | 0.409 |
#######
# We will add the validation set performance scores for the Tuned Models beside the Confidence Intervals
# so that it will be easy for us to compare the results.
#######
tuned_val_perf = [] # create an emprty list
# Append the Recall scores for the Tuned Models on validation data
tuned_val_perf.append(bag_ROS_model_val_perf.at[0, "Recall"])
tuned_val_perf.append(rf_ROS_model_val_perf.at[0, "Recall"])
tuned_val_perf.append(gb_ROS_model_val_perf.at[0, "Recall"])
tuned_val_perf.append(XGBoost_ROS_model_val_perf.at[0, "Recall"])
tuned_val_perf.append(rf_SMOTE_model_val_perf.at[0, "Recall"])
tuned_val_perf.append(XGBoost_SMOTE_model_val_perf.at[0, "Recall"])
# Add the list to the CI dataframe
confidence_intervals_df["Tuned_Val_Perf"] = tuned_val_perf
confidence_intervals_df
| Lower CI | Upper CI | Val_Perf | Tuned_Val_Perf | |
|---|---|---|---|---|
| Bagging_ROS | 0.992 | 1.000 | 0.813 | 0.945 |
| RandomForest_ROS | 0.996 | 0.999 | 0.844 | 0.911 |
| GradientBoost_ROS | 0.973 | 0.980 | 0.939 | 0.899 |
| XGBoost_ROS | 0.998 | 1.000 | 0.936 | 1.000 |
| RandomForest_SMOTE | 0.965 | 0.975 | 0.856 | 0.902 |
| XGBoost_SMOTE | 0.974 | 0.983 | 0.893 | 1.000 |
Observations
XGBoost_SMOTE_tuned and XGBoost_ROS_tuned both achieved a Recall score on Validation set of 1.0 (100%).
XGBoost_SMOTE_tuned and it gives a Confidence that is greater than 95% Confidence Interval.
XGBoost_ROS_tuned and it gives us a Confidence with 95% Confidence Interval.
Bagging_ROS_tuned is our 3rd best model with a Recall score on Validation set of 0.945 (94.5%)...which is less than 95% Confidence Interval.
# Checking the model performance for Unseen Test set
XGBoost_ROS_model_test_perf = model_performance_classification_sklearn(
XGBoost_ROS_tuned, X_test, y_test
)
print("XGBoost_ROS_tuned TEST performance:")
XGBoost_ROS_model_test_perf
XGBoost_ROS_tuned TEST performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.160 | 1.000 | 0.160 | 0.276 |
# Confusion matrix
confusion_matrix_sklearn(XGBoost_ROS_tuned, X_test, y_test)
Observations
XGBoost_ROS_tuned with a Recall Score of 1.0 (100%) on both Validation and Test data...and it gives a Confidence that is greater than 95% CI#Plot the feature importances
feature_names = X_train.columns
importances = XGBoost_ROS_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations
Total_Trans_Ct and Total_Revolving_Bal are most important features.# Import shap - helps in visualizing the relationships in the model
import shap
# Initialize the package
shap.initjs()
# calculating SHAP values
explainer = shap.TreeExplainer(XGBoost_ROS_tuned)
shap_values = explainer.shap_values(X_train)
# Make plot.
shap.summary_plot(shap_values, X_train)
Observations
Total_Trans_Ct and Total_Revolving_Bal are the top (and only) two important features that contributes in the prediction of target.
######
# User defined function to be used by FunctionTransformer in creating Pipeline.
######
def myProcessingSteps(df):
# -----
# Drop CLIENTNUM
# -----
df.drop(["CLIENTNUM"], axis=1, inplace=True)
# -----
# Replacing the corrupt data containing 'abc' with NAN
# -----
df.replace("abc", np.nan, inplace=True)
# -----
# Perform a log transformation of the numerical columns
# -----
for col in num_features:
df[col] = np.log(df[col] + 1)
# -----
# Minmax scaling of numeric features
# -----
for col in num_features:
df[col] = MinMaxScaler().fit_transform(df[[col]])
return df
Convert myProcessingSteps() into a transformer object that will be integrated Pipeline.
func_transform = FunctionTransformer(myProcessingSteps)
Apply different preprocessing steps to different subsets of columns.
# Creating a transformer for numerical variables, which will apply simple imputer on the numerical variables
numeric_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="median")),
]
)
# Creating a transformer for categorical variables, which will first apply simple imputer and
# then do one hot encoding for categorical variables
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)
# Combining categorical transformer and numerical transformer using a column transformer
col_transform = ColumnTransformer(
transformers=[
("num", numeric_transformer, num_features),
("cat", categorical_transformer, cat_features),
],
remainder="passthrough",
)
# Check the original dataset
df.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
# Dividing dataset into X and y
X = df.drop(["Attrition_Flag"], axis=1)
y = df["Attrition_Flag"].apply(lambda x: 1 if x == "Attrited Customer" else 0)
# Splitting data into training and test set:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=1, stratify=y
)
print(X_train.shape, X_test.shape)
(7088, 20) (3039, 20)
churn_pipe = make_imb_pipeline(
func_transform,
col_transform,
RandomOverSampler(random_state=1),
XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=1,
scale_pos_weight=12,
reg_lambda=10,
n_estimators=150,
max_depth=1,
learning_rate=0.01,
gamma=0,
),
)
# Fit the model on training data
churn_pipe.fit(X_train, y_train)
Pipeline(steps=[('functiontransformer',
FunctionTransformer(func=<function myProcessingSteps at 0x7f8f114f9790>)),
('columntransformer',
ColumnTransformer(remainder='passthrough',
transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
['Customer_Age',
'Months_on_book',
'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Contact...
feature_types=None, gamma=0, gpu_id=None,
grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.01,
max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None,
max_depth=1, max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=150,
n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=1, ...))])
# Calculating different metrics on test set
XGBoost_ROS_tuned_pipeline_perf = model_performance_classification_sklearn(
churn_pipe, X_test, y_test
)
print("Test performance:")
XGBoost_ROS_tuned_pipeline_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.161 | 1.000 | 0.161 | 0.277 |
From the Exploratory Data Analysis done, the following observations and key insights became evident:
Avg_Open_To_Buy and Credit_Limit suggests that Customers are not using their cards. On further analysis it was shown that:
◎Only a small fraction of customers (0.3%) are active every month.
◎About 45.3% of customers have not used their cards for 3 months and over, in last 12 months
Months_on_book) does not appear to affect Attrition. Also, how old a customer is (Customer_Age) does not appear to affect Attrition.
Total_Revolving_Bal (the balance that carries over from one month to the next), low Avg_Utilization_Ratio (how much of the available credit the customer spent) & low Total_Trans_Ct (total transaction count (Last 12 months)).From our BEST of 30 models, Tuned XGBoost on RandomOverSampled dataset, the following observations and key insights became evident:
Tuned XGBoost on RandomOverSampled dataset gives us Best RECALL Score of 1.0 (100%) with lowest Bias and lowest Variance.
This Recall score of 1.0 gives a confidence of greater than 95% Confidence Interval.
From the Feature Importances of our BEST model, the 2 most important features are Total_Trans_Ct and Total_Revolving_Bal.
Based on the Predictive Models, Analysis, Insights & Observations...here are specific business recommendations:
◎ The Total Transaction Count of customers.
◎ The Total Revolving Balance: the balance that carries over from one month to the next.
◎ The Utilization_Ratio: how much of the available credit the customer spent.
Avg_Open_To_Buy (the amount left on the credit card to use) per Credit_Limit
$120K should also be focused upon to enhance their satisfaction to reduce the attrition.Further Analysis and Modeling:
◎ Further analysis & modelling can to be done to unearth invaluable information.